These notes are produced using RMarkdown. Open the document and try to execute them in RStudio. If you are new users of ggplot2 execute the following commands.
install.packages("Hmisc")
install.packages("ggplot2")
install.packages("ggthemes")
The examples follow the book on `ggplot2 by Hadley Wickham, more topical datasets will be studied in the upcoming lectures.
ggplot2 is graphical library for the R programming language designed by Hadley Wickham around 2010 (the first version of gplot is from 2005).
ggplot2 allows user to compose visualization based on a set of principles. A graphic is composed of:
gplot requires data to be stored in an R data.frame object. Data frames can be built from CSV files (e.g. by using the read.csv function) or from scratch.
For this lecture we will start with a data frame that is included in the R release, namely mpg. This dataset contains a subset of the fuel economy data that the EPA. It contains only models which had a new release every year between 1999 and 2008.
Typing ?mpg in the console will give more details.
Before we start visualizing the data lets look at it from the consol. The function str give information about the structure of an object in a compact way that if often informative.
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
We can print the first ten rows of the dataset.
mpg[1:10, ]
## # A tibble: 10 × 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 2 more variables: fl <chr>, class <chr>
Typing ?ggplot shows how to invoke `ggplot.
The simplest use would be to specify only the data.
ggplot(data = mpg)
As we can see, this does not render anything as we have not specified what aspect of the data we wanted to display.
The first thing we could specify is an aesthetic mapping that states our x axis will be the class attributes (wether a car is a compact or any other kind).
ggplot(mpg, mapping = aes(x = class))
As we can see, this render each class, and adds an axis legend. Still, no data.
Observe that we did not have to specify that the first argument is a data set, similarly we can omit mapping and x (and we will as long as they are clear from context).
ggplot(mpg, aes(class))
Let’s start with a simple bar plot that shows the number of individual vehicles in each class.
ggplot(mpg, aes(class)) +
geom_bar()
Clearly any other variable can be graphed, such as drv.
ggplot(mpg, aes(drv)) + geom_bar()
We can also present another variable on the y axis. For example the highway miles per galon or hwy.
ggplot(mpg, aes(x=class, y=hwy)) +
geom_bar()
Not all pairing of aes and geom make sense. Let’s find a geom that works with two variables.
ggplot(mpg, aes(class, hwy)) +
geom_point()
Consider the fact that length(mpg$class) = 234, there are 234 observations but we can’t see 234 points! We will return to this later.
We can use a statistics to summarize data. Here we will plot the mean highway mile per galon for each car class.
ggplot(mpg, aes(class, hwy)) +
geom_bar(stat = "summary",
fun.y = mean)
Other summaries can be applied. The following graph show shte median, which is not too different from the mean in this case.
ggplot(mpg, aes(class, hwy)) +
geom_bar(stat = "summary",
fun.y = median)
And, this last graph shows the max, which has a similar shape but the y-axis values differ.
ggplot(mpg, aes(class, hwy)) +
geom_bar(stat = "summary",
fun.y = max)
Imagine that we want to plot three variable, class, hwy and cty. How could we do this? We can use an aes where the color differentiate between highway and city consumption. To build such a graphic we need to reshape our dataset so that cty and hwy entries are on different rows.
We can use base R operations to do this in a few simple steps. First we extract the columns we are interested in. hwy is a new data.frame.
hwy <- mpg[, c("class", "hwy")] # extract two columns
hwy[1:10, ] #print
## # A tibble: 10 × 2
## class hwy
## <chr> <int>
## 1 compact 29
## 2 compact 29
## 3 compact 31
## 4 compact 30
## 5 compact 26
## 6 compact 26
## 7 compact 27
## 8 compact 26
## 9 compact 25
## 10 compact 28
Next we add a column to the data frame, that column has a character string hwy for all observations.
hwy <- cbind(hwy, "hwy") # cbins adds one column to the data frame
hwy[1:5, ]
## class hwy "hwy"
## 1 compact 29 hwy
## 2 compact 29 hwy
## 3 compact 31 hwy
## 4 compact 30 hwy
## 5 compact 26 hwy
Now we rename the columns (this is needed because merging data frame is done according to column names).
names(hwy) <- c("class", "val", "type")
hwy[ 1:5, ]
## class val type
## 1 compact 29 hwy
## 2 compact 29 hwy
## 3 compact 31 hwy
## 4 compact 30 hwy
## 5 compact 26 hwy
We do the same for the cty variable.
cty <- mpg[, c("class", "cty")]
cty <- cbind(cty, "cty")
names(cty) <- c("class", "val", "type")
Finally we can merge the two data frames.
ds <- rbind(hwy, cty) ## rbind adds to data frame together row-wise
ds[sample(1:nrow(ds), 10),] ## print 10 random entries
## class val type
## 230 midsize 28 hwy
## 48 minivan 23 hwy
## 419 midsize 18 cty
## 455 compact 17 cty
## 281 minivan 16 cty
## 287 pickup 14 cty
## 346 midsize 21 cty
## 386 suv 15 cty
## 362 suv 14 cty
## 124 suv 19 hwy
Now the data is in the right shape, let’s plot it. We use the color to differentiate between types of consumption.
ggplot(ds, aes(class, val, fill=type)) +
geom_bar(stat="summary",
fun.y=mean)
This is not ideal as it is difficult to compare the relative heights of the top bars.
We can control the position of the bar.
ggplot(ds, aes(class, val, fill=type)) +
geom_bar(stat="summary",
fun.y=mean,
position="dodge")
This allows to compare the relative heights of each kind of consumption.
geoms <- help.search("^geom_", package = "ggplot2")
unique(geoms$matches[, 1:2])
## Topic
## 1 geom_abline
## 4 geom_bar
## 6 geom_bin2d
## 7 geom_blank
## 8 geom_boxplot
## 9 geom_contour
## 10 geom_count
## 11 geom_density
## 12 geom_density_2d
## 14 geom_dotplot
## 15 geom_errorbarh
## 16 geom_hex
## 17 geom_freqpoly
## 19 geom_jitter
## 20 geom_crossbar
## 24 geom_map
## 25 geom_path
## 28 geom_point
## 29 geom_polygon
## 30 geom_qq
## 31 geom_quantile
## 32 geom_ribbon
## 34 geom_rug
## 35 geom_segment
## 37 geom_smooth
## 38 geom_spoke
## 39 geom_label
## 41 geom_raster
## 44 geom_violin
## Title
## 1 Reference lines: horizontal, vertical, and diagonal
## 4 Bars charts
## 6 Heatmap of 2d bin counts
## 7 Draw nothing
## 8 A box and whiskers plot (in the style of Tukey)
## 9 2d contours of a 3d surface
## 10 Count overlapping points
## 11 Smoothed density estimates
## 12 Contours of a 2d density estimate
## 14 Dot plot
## 15 Horizontal error bars
## 16 Hexagonal heatmap of 2d bin counts
## 17 Histograms and frequency polygons
## 19 Jittered points
## 20 Vertical intervals: lines, crossbars & errorbars
## 24 Polygons from a reference map
## 25 Connect observations
## 28 Points
## 29 Polygons
## 30 A quantile-quantile plot
## 31 Quantile regression
## 32 Ribbons and area plots
## 34 Rug plots in the margins
## 35 Line segments and curves
## 37 Smoothed conditional means
## 38 Line segments parameterised by location, direction and distance
## 39 Text
## 41 Rectangles
## 44 Violin plot
stats <- help.search("^stat_", package = "ggplot2", fields="name")
stats$matches[, 1:2]
## Topic Title
## 1 stat_ecdf Compute empirical cumulative distribution
## 2 stat_ellipse Compute normal confidence ellipses
## 3 stat_function Compute function for each x value
## 4 stat_identity Leave data as is
## 5 stat_summary_bin Summarise y values at unique/binned x
## 6 stat_summary_2d Bin and summarise in 2d (rectangle & hexagons)
## 7 stat_unique Remove duplicates
unique(help.search("^position_", package = "ggplot2")$matches[,1:2])
## Topic Title
## 1 position_dodge Dodge overlapping objects side-to-side
## 2 position_identity Don't adjust position
## 3 position_jitter Jitter points to avoid overplotting
## 4 position_jitterdodge Simultaneously dodge and jitter
## 5 position_nudge Nudge points a fixed distance
## 6 position_stack Stack overlapping objects on top of each another
We start by showing how to compute error bars and add them manually, and demonstrate how to leverage a library that will do the same for us.
What we need is to compute the confidence interval for each class of car and for both cty and hwy driving conditions. The data frame has all the information we need but they are jumbled up. The mean_cl_boot() function computes the confidence intervals. We need to give it that data.
Here is what we would do for cty and suv:
suvs <- ds[ ds$class=="suv", ]
vals <- suvs[ suvs$type == "cty", "val"]
mean_cl_boot(vals)
## y ymin ymax
## 1 13.5 12.95121 14.08105
What we want to do is repeat this code for each class and driving condition. But that would be too much boilerplate code, and too likely to contain errors. Instead we can use lapply() to do it in a generic manner.
ds2 <- lapply(unique(ds$class), function(x) {
v <- ds[ ds$class==x, ]
vals <- v[ v$type == "cty", "val"]
cty <- mean_cl_boot(vals)
cty["type"] <- "cty"
vals <- v[ v$type == "hwy", "val"]
hwy <- mean_cl_boot(vals)
hwy["type"] <- "hwy"
df <- rbind(cty, hwy)
df["class"]<-x
df
})
ds2 <- do.call(rbind, ds2)
ds2
## y ymin ymax type class
## 1 20.12766 19.21223 21.19202 cty compact
## 2 28.29787 27.27660 29.38298 hwy compact
## 3 18.75610 18.17073 19.36585 cty midsize
## 4 27.29268 26.65854 27.95122 hwy midsize
## 5 13.50000 12.90323 14.11331 cty suv
## 6 18.12903 17.43548 18.88750 hwy suv
## 7 15.40000 15.00000 15.80000 cty 2seater
## 8 24.80000 23.80000 25.80000 hwy 2seater
## 9 15.81818 14.63636 16.72727 cty minivan
## 10 22.36364 21.00000 23.36364 hwy minivan
## 11 13.00000 12.30303 13.66742 cty pickup
## 12 16.87879 16.09091 17.66667 hwy pickup
## 13 20.37143 18.88571 21.88571 cty subcompact
## 14 28.14286 26.34286 30.00000 hwy subcompact
Now we can graph the resulting data. We use the identity stat since we are not asking ggplot to perform any computation. We specify that the fill is the type variable, ggplot picks the color automatically.
ggplot(ds2, aes(class, y, fill=type)) +
geom_bar(stat="identity",
position="dodge") +
geom_errorbar( aes(ymin = ymin, ymax = ymax),
position = position_dodge(.9),
width = .2)
In this particular case, the reshaping of the data can be avoided as it is possible to pass the mean_cl_boot() function to stat_summary directly.
ggplot(ds, aes(class, val, fill=type)) +
geom_bar(stat="summary", fun.y=mean, position="dodge") +
stat_summary(fun.data=mean_cl_boot, color="black", geom="errorbar", position=position_dodge(.9), width=.2)
The mtcars dataset has names for every row. It is thus possible to use those names as lavels of points in a graph.
m <- mtcars[1:10,]
ggplot(m, aes(mpg, wt)) +
geom_point() +
geom_text( aes( label = rownames(m)), nudge_x = .1, nudge_y = -.1, check_overlap = F )
m <- mtcars[1:20,]
ggplot(m, aes(mpg, wt)) +
geom_point() +
geom_text( aes( label = rownames(m)),
position=position_jitter(width=.1, height=.2) )
Clearly there are too many points for this to be legible. We can reduce the size a bit.
ggplot(mtcars, aes(mpg, wt)) +
geom_point() +
geom_text(aes(label=rownames(mtcars)), size=2)
library(ggrepel)
m <- mtcars
ggplot(m, aes(mpg, wt)) +
geom_point() +
geom_text_repel( aes( label = rownames(m)), check_overlap = F,
position=position_jitter(width=.1, height=.2) )
## Warning: Ignoring unknown parameters: check_overlap, position
Another thing we can do is to use the weight of the car to determine the size of the text.
ggplot(mtcars, aes(mpg, wt)) +
geom_point() +
geom_text_repel(aes(label=rownames(mtcars), size=wt))
We can also put labels on the figure.
ggplot(mtcars, aes(mpg, wt)) +
geom_point() +
annotate("text", label = "plot mpg vs. wt", x = 18, y = 5, size = 8, colour = "red")
ggplot(mpg, aes(class, cty)) +
geom_bar(stat="identity") +
geom_text(aes(label=cty), color="white")
This is not quite what we want.
mpg[mpg$class=="2seater", "cty"]
## # A tibble: 5 × 1
## cty
## <int>
## 1 16
## 2 15
## 3 16
## 4 15
## 5 15
ds2 <- lapply(unique(mpg$class), function(x) {
v <- ds[mpg$class==x, ]
cty <- mean(v[v$type == "cty", "val"])
data.frame(class=x, cty)
})
ds2 <- do.call(rbind, ds2)
ds2
## class cty
## 1 compact 20.12766
## 2 midsize 18.75610
## 3 suv 13.50000
## 4 2seater 15.40000
## 5 minivan 15.81818
## 6 pickup 13.00000
## 7 subcompact 20.37143
ggplot(ds2, aes(class, cty)) +
geom_bar(stat="identity") +
geom_text(aes(label=round(cty, digits=2)), color="white", vjust=2)
p <- ggplot(mpg, aes(x=displ, y=hwy))
p + geom_point(aes(color=class, size=cyl)) +
labs(title="Fuel economy in relation to engine dislacement",
subtitle="Less is more",
caption="A caption",
x="engine displacement",
y="highway miles per gallon")
p <- ggplot(mpg, aes(x=displ, y=hwy))
p + geom_point(aes(color=class, size=cyl)) +
labs(title="Fuel economy in relation to engine dislacement",
subtitle="Less is more",
caption="A caption",
x="engine displacement",
y="highway miles per gallon") +
guides(color=guide_legend(title="Car class"),
size=guide_legend(title="Number of cylinders"))
p <- ggplot(mpg, aes(x=model, y=hwy))
p + geom_bar(stat="identity")
p <- ggplot(mpg, aes(x=model, y=hwy))
p + geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90))
p <- ggplot(mpg, aes(x=model, y=hwy))
p + geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
library(ggthemes)
p <- ggplot(mpg, aes(x=displ, y=hwy))
p + geom_point(aes(color=class, size=cyl))
p <- ggplot(mpg, aes(x=displ, y=hwy))
p + geom_point(aes(color=class, size=cyl)) +
theme_economist() +
scale_colour_economist()
p <- ggplot(mpg, aes(x=displ, y=hwy))
p + geom_point(aes(color=class, size=cyl)) +
theme_minimal()
p <- ggplot(mpg, aes(x=displ, y=hwy))
p + geom_point(aes(color=class, size=cyl)) +
theme_excel() + scale_colour_excel()
p <- ggplot(mpg, aes(displ, hwy))
p + geom_point() +
geom_smooth(method="lm")
p <- ggplot(mpg, aes(displ, hwy, color=factor(cyl)))
p + geom_point() + geom_smooth(method="lm")
p <- ggplot(mpg, aes(displ, hwy))
p + geom_point() + geom_smooth(method="loess")
A formula with the rows (of the tabular display) on the LHS and the columns (of the tabular display) on the RHS; the dot in the formula is used to indicate there should be no faceting on this dimension (either row or column).
y ~ x
x is the explanatory variabley is the response variabley ~ x + z
p <- ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()
p + facet_grid(. ~ cyl)
p <- ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()
p + facet_grid(cyl ~ .)
p <- ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()
p + facet_grid(drv ~ cyl)
p <- ggplot(mtcars, aes(mpg, wt, colour = factor(cyl))) + geom_point()
p + facet_grid(. ~ cyl, scales = "free")